1,996 research outputs found

    Online Targeted Learning

    Get PDF
    We consider the case that the data comes in sequentially and can be viewed as sample of independent and identically distributed observations from a fixed data generating distribution. The goal is to estimate a particular path wise target parameter of this data generating distribution that is known to be an element of a particular semi-parametric statistical model. We want our estimator to be asymptotically efficient, but we also want that our estimator can be calculated by updating the current estimator based on the new block of data without having to revisit the past data, so that it is computationally much faster to compute than recomputing a fixed estimator each time new data comes in. We refer to such an estimator as an online estimator. These online estimators can also be applied on a large fixed data base by dividing the data set in many subsets and enforcing an ordering of these subsets. The current literature provides such online estimators for parametric models, where the online estimators are based on variations of the stochastic gradient descent algorithm. For that purpose we propose a new online one-step estimator, which is proven to be asymptotically efficient under regularity conditions. This estimator takes as input online estimators of the relevant part of the data generating distribution and the nuisance parameter that are required for efficient estimation of the target parameter. These estimators could be an online stochastic gradient descent estimator based on large parametric models as developed in the current literature, but we also propose other online data adaptive estimators that do not rely on the specification of a particular parametric model. We also present a targeted version of this online one-step estimator that presumably minimizes the one-step correction and thereby might be more robust in finite samples. These online one-step estimators are not a substitution estimator and might therefore be unstable for finite samples if the target parameter is borderline identifiable. Therefore we also develop an online targeted minimum loss-based estimator, which updates the initial estimator of the relevant part of the data generating distribution by updating the current initial estimator with the new block of data, and estimates the target parameter with the corresponding plug-in estimator. The online substitution estimator is also proven to be asymptotically efficient under the same regularity conditions required for asymptotic normality of the online one-step estimator. The online one-step estimator, targeted online one-step estimator, and online TMLE is demonstrated for estimation of a causal effect of a binary treatment on an outcome based on a dynamic data base that gets regularly updated, a common scenario for the analysis of electronic medical record data bases. Finally, we extend these online estimators to a group sequential adaptive design in which certain components of the data generating experiment are continuously fine-tuned based on past data, and the new data generating distribution is then used to generate the next block of data

    Identification and Efficient Estimation of the Natural Direct Effect Among the Untreated

    Get PDF
    The natural direct effect (NDE), or the effect of an exposure on an outcome if an intermediate variable was set to the level it would have been in the absence of the exposure, is often of interest to investigators. In general, the statistical parameter associated with the NDE is difficult to estimate in the non-parametric model, particularly when the intermediate variable is continuous or high dimensional. In this paper we introduce a new causal parameter called the natural direct effect among the untreated, discus identifiability assumptions, and show that this new parameter is equivalent to the NDE in a randomized control trial. We also present a targeted minimum loss estimator (TMLE), a locally efficient, double robust substitution estimator for the statistical parameter associated with this causal parameter. The TMLE can be applied to problems with continuous and high dimensional intermediate variables, and can be used to estimate the NDE in a randomized controlled trial with such data. Additionally, we define and discuss the estimation of three related causal parameters: the natural direct effect among the treated, the indirect effect among the untreated and the indirect effect among the treated

    Application of a Variable Importance Measure Method to HIV-1 Sequence Data

    Get PDF
    van der Laan (2005) proposed a method to construct variable importance measures and provided the respective statistical inference. This technique involves determining the importance of a variable in predicting an outcome. This method can be applied as an inverse probability of treatment weighted (IPTW) or double robust inverse probability of treatment weighted (DR-IPTW) estimator. A respective significance of the estimator is determined by estimating the influence curve and hence determining the corresponding variance and p-value. This article applies the van der Laan (2005) variable importance measures and corresponding inference to HIV-1 sequence data. In this data application, protease and reverse transcriptase codon position on the HIV-1 strand are assessed to determine their respective variable importance, with respect to an outcome of viral replication capacity. We estimate the W-adjusted variable importance measure for a specified set of potential effect modifiers W. Both the IPTW and DR-IPTW methods were implemented on this datase

    Local Environment of Ferromagnetically Ordered Mn in Epitaxial InMnAs

    Full text link
    The magnetic properties of the ferromagnetic semiconductor In0.98Mn0.02As were characterized by x-ray absorption spectroscopy and x-ray magnetic circular dichroism. The Mn exhibits an atomic-like L2,3 absorption spectrum that indicates that the 3d states are highly localized. In addition, a large dichroism at the Mn L2,3 edge was observed from 5-300 K at an applied field of 2T. A calculated spectrum assuming atomic Mn2+ yields the best agreement with the experimental InMnAs spectrum. A comparison of the dichroism spectra of MnAs and InMnAs show clear differences suggesting that the ferromagnetism observed in InMnAs is not due to hexagonal MnAs clusters. The temperature dependence of the dichroism indicates the presence of two ferromagnetic species, one with a transition temperature of 30 K and another with a transition temperature in excess of 300 K. The dichroism spectra are consistent with the assignment of the low temperature species to random substitutional Mn and the high temperature species to Mn near-neighbor pairs.Comment: 10 pages, 4 figures, accepted by Applied Physics Letter

    Multiple Testing Procedures for Controlling Tail Probability Error Rates

    Get PDF
    The present article discusses and compares multiple testing procedures (MTP) for controlling Type I error rates defined as tail probabilities for the number (gFWER) and proportion (TPPFP) of false positives among the rejected hypotheses. Specifically, we consider the gFWER- and TPPFP-controlling MTPs proposed recently by Lehmann & Romano (2004) and in a series of four articles by Dudoit et al. (2004), van der Laan et al. (2004b,a), and Pollard & van der Laan (2004). The former Lehmann & Romano (2004) procedures are marginal, in the sense that they are based solely on the marginal distributions of the test statistics, i.e., on cut-off rules for the corresponding unadjusted p-values. In contrast, the procedures discussed in our previous articles take into account the joint distribution of the test statistics and apply to general data generating distributions, i.e., dependence structures among test statistics. The gFWER-controlling common-cut-off and common-quantile procedures of Dudoit et al. (2004) and Pollard & van der Laan (2004) are based on the distributions of maxima of test statistics and minima of unadjusted p-values, respectively. For a suitably chosen initial FWER-controlling procedure, the gFWER- and TPPFP-controlling augmentation multiple testing procedures (AMTP) of van der Laan et al. (2004a) can also take into account the joint distribution of the test statistics. Given a gFWER-controlling procedure, we also propose AMTPs for controlling tail probability error rates, Pr(g(V_n,R_n) \u3e q), for arbitrary functions g(V_n,R_n) of the numbers of false positives V_n and rejected hypotheses R_n. The different gFWER- and TPPFP-controlling procedures are compared in a simulation study, where the tests concern the components of the mean vector of a multivariate Gaussian data generating distribution. Among notable findings are the substantial power gains achieved by joint procedures compared to marginal procedures

    Balancing Score Adjusted Targeted Minimum Loss-based Estimation

    Get PDF
    Adjusting for a balancing score is sufficient for bias reduction when estimating causal effects including the average treatment effect and effect among the treated. Estimators that adjust for the propensity score in a nonparametric way, such as matching on an estimate of the propensity score, can be consistent when the estimated propensity score is not consistent for the true propensity score but converges to some other balancing score. We call this property the balancing score property, and discuss a class of estimators that have this property. We introduce a targeted minimum loss-based estimator (TMLE) for a treatment specific mean with the balancing score property that is additionally locally efficient and doubly robust. We investigate the new estimator\u27s performance relative to other estimators, including another TMLE, a propensity score matching estimator, an inverse probability of treatment weighted estimator, and a regression based estimator in simulation studies

    Semiparametric theory and empirical processes in causal inference

    Full text link
    In this paper we review important aspects of semiparametric theory and empirical processes that arise in causal inference problems. We begin with a brief introduction to the general problem of causal inference, and go on to discuss estimation and inference for causal effects under semiparametric models, which allow parts of the data-generating process to be unrestricted if they are not of particular interest (i.e., nuisance functions). These models are very useful in causal problems because the outcome process is often complex and difficult to model, and there may only be information available about the treatment process (at best). Semiparametric theory gives a framework for benchmarking efficiency and constructing estimators in such settings. In the second part of the paper we discuss empirical process theory, which provides powerful tools for understanding the asymptotic behavior of semiparametric estimators that depend on flexible nonparametric estimators of nuisance functions. These tools are crucial for incorporating machine learning and other modern methods into causal inference analyses. We conclude by examining related extensions and future directions for work in semiparametric causal inference

    Spin and orbital moments of ultra-thin Fe films on various semiconductor surfaces

    Get PDF
    The magnetic moments of ultrathin Fe films on three different III-V semiconductor substrates, namely GaAs, InAs and In0.2Ga0.8As have been measured with X-ray magnetic circular dichroism at room temperature to assess their relative merits as combinations suitable for next-generation spintronic devices. The results revealed rather similar spin moments and orbital moments for the three systems, suggesting the relationship between film and semiconductor lattice parameters to be less critical to magnetic moments than magnetic anisotropy

    Multiple Testing and Data Adaptive Regression: An Application to HIV-1 Sequence Data

    Get PDF
    Analysis of viral strand sequence data and viral replication capacity could potentially lead to biological insights regarding the replication ability of HIV-1. Determining specific target codons on the viral strand will facilitate the manufacturing of target specific antiretrovirals. Various algorithmic and analysis techniques can be applied to this application. We propose using multiple testing to find codons which have significant univariate associations with replication capacity of the virus. We also propose using a data adaptive multiple regression algorithm to obtain multiple predictions of viral replication capacity based on an entire mutant/non-mutant sequence profile. The data set to which these techniques were applied consists of 317 patients, each with 282 sequenced protease and reverse transcriptase codons. Initially, the multiple testing procedure (Pollard and van der Laan, 2003) was applied to the individual specific viral sequence data. A single-step multiple testing procedure method was used to control the family wise error rate (FWER) at the five percent alpha level. Additional augmentation multiple testing procedures were applied to control the generalized family wise error (gFWER) or the tail probability of the proportion of false positives (TPPFP). Finally, the loss-based, cross-validated Deletion/Substitution/Addition regression algorithm (Sinisi and van der Laan, 2004) was applied to the dataset separately. This algorithm builds candidate estimators in the prediction of a univariate outcome by minimizing an empirical risk, and it uses cross-validation to select fine-tuning parameters such as: size of the regression model, maximum allowed order of interaction of terms in the regression model, and the dimension of the vector of covariates. This algorithm also is used to measure variable importance of the codons. Findings from these multiple analyses are consistent with biological findings and could possibly lead to further biological knowledge regarding HIV-1 viral data

    Resampling Based Multiple Testing Procedure Controlling Tail Probability of the Proportion of False Positives

    Get PDF
    Simultaneously testing a collection of null hypotheses about a data generating distribution based on a sample of independent and identically distributed observations is a fundamental and important statistical problem involving many applications. In this article we propose a new resampling based multiple testing procedure asymptotically controlling the probability that the proportion of false positives among the set of rejections exceeds q at level alpha, where q and alpha are user supplied numbers. The procedure involves 1) specifying a conditional distribution for a guessed set of true null hypotheses, given the data, which asymptotically is degenerate at the true set of null hypotheses, and 2) specifying a generally valid null distribution for the vector of test-statistics proposed in Pollard and van der Laan (2003), and generalized in our subsequent articles Dudoit et al. (2004), van der Laan et al. (2004a) and van der Laan et al. (2004b). We establish the finite sample rational behind our proposal, and prove that this new multiple testing procedure asymptotically controls the wished tail probability for the proportion of false positives under general data generating distributions. In addition, we provide simulation studies establishing that this method is generally more powerful in finite samples than our previously proposed augmentation multiple testing procedure (van der Laan et al. (2004b)) and competing procedures from the literature. Finally, we illustrate our methodology with a data analysis
    • …
    corecore